[None][chore] Fix lock_infra_error#15213
Conversation
…formance tests and configurations - Updated model paths to include nemotron_3_ultra_550b_nvfp4 in HF_MODEL_PATH. - Added configuration settings for nemotron_3_ultra_550b_nvfp4 in pytorch_model_config.py. - Included new performance test cases for nemotron_3_ultra_550b_nvfp4 in test_perf.py and updated llm_perf_core.yml. - Cleaned up legacy model name handling in test_perf.py. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
…ts for nemotron and llama models - Reintroduced performance tests for nemotron_nano_12b_v2 and qwen3.5_27b models with various configurations. - Added performance tests for llama_v3.3_nemotron_super_49b with multiple input/output lengths and GPU configurations. - Ensured comprehensive coverage of performance benchmarks in the llm_perf_core.yml file. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
- Removed redundant test cases for llama_v3.1_nemotron_ultra_253b and adjusted the configuration for qwen3.5_122b_a10b. - Added back performance tests for llama_v3.1_nemotron_ultra_253b with various input/output lengths and GPU configurations. - Updated comments for clarity on the test cases included. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Addresses CodeRabbit review: --log_level=info is a static literal and does not need an f-string prefix (ruff F541). Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Introduced a new helper function to identify lock-infrastructure errors, improving the robustness of the config_file_lock context manager. This change allows for better handling of temporary directory fallbacks during lock acquisition failures, ensuring that exceptions are properly propagated and logged. Signed-off-by: [Your Name] <your.email@example.com> Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughRefactors the HuggingFace cache lock acquisition to classify infrastructure errors and implement fallback logic. Adds a helper function to identify timeout and permission-related lock failures, then updates the context manager to acquire locks explicitly and conditionally retry with a tempdir-based lock when primary acquisition fails for infrastructure reasons. ChangesLock infrastructure improvement
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/model_config.py`:
- Around line 72-73: The helper _is_lock_infra_error currently treats
filelock.Timeout as an infrastructure failure; update that function to stop
classifying filelock.Timeout as an infra error (it is an acquisition timeout,
not broken lock infra). Locate _is_lock_infra_error and remove filelock.Timeout
from the isinstance check so only true infrastructure errors (e.g.,
PermissionError or other genuine file-locking exceptions you want to keep)
trigger the tempdir/no-lock fallback paths; ensure PermissionError handling
remains unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7b0c3216-d3ec-4a6d-a146-9bb344730ee7
📒 Files selected for processing (1)
tensorrt_llm/_torch/model_config.py
Updated the logic in the _is_lock_infra_error function to better differentiate between lock contention and broken infrastructure. Enhanced the config_file_lock context manager to log warnings appropriately when lock acquisition fails, ensuring clearer error handling and fallback behavior. Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
|
/bot run |
|
PR_Github #53491 [ run ] triggered by Bot. Commit: |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
PR_Github #53491 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #53818 [ run ] triggered by Bot. Commit: |
|
PR_Github #53818 [ run ] completed with state
|
Introduced a new helper function to identify lock-infrastructure errors, improving the robustness of the config_file_lock context manager. This change allows for better handling of temporary directory fallbacks during lock acquisition failures, ensuring that exceptions are properly propagated and logged.
Signed-off-by: yufeiwu-nv 230315618+yufeiwu-nv@users.noreply.github.com
@coderabbitai summary
Description
Problem
config_file_lock()re-raisesfilelock.Timeoutinstead of using its tempdir fallback. The errno-narrowing added in #11960 —isinstance(e, OSError) and e.errno not in {EACCES, EPERM, ENOLCK, ESTALE}— was meant to let non-lockOSErrors propagate. Butfilelock.Timeoutis anOSErrorsubclass witherrno=None, so it satisfies that condition and gets re-raised, defeating the very lock-acquisition-timeout fallback the function is supposed to provide.Impact
When multiple ranks load a
trust_remote_codemodel concurrently (tp/ep > 1), they contend on the single global_remote_code.lock. The ranks that time out crash during executor init and triggerMPI_ABORT— observed on thedeepseek_r1_0528_fp4 ... ep:4-tp:4perf test.Fix
Refactor
config_file_lockinto a single-yield context manager that guards only theacquire()call. Theyieldis moved intoelse+finally release, so exceptions raised by the caller body (e.g. HFRepositoryNotFoundError, also anOSErrorsubclass) propagate cleanly without a second yield. Fallback-eligible failures are now selected viaisinstance— matchingfilelock.TimeoutandPermissionErrorexplicitly, plus NFS errnosENOLCK/ESTALE.Verification
Reproduced with a real-
filelockmulti-process contention test (1 holder + 3 workers): the shipped logic makes all waiting workers raiseTimeout, while the fix lets them all fall back and succeed.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.